dna data storage
Clustering Billions of Reads for DNA Data Storage
Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. Datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy. To address this issue, we present a novel distributed algorithm for approximately computing the underlying clusters.
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- South America > Peru > Cusco Department > Cusco Province > Cusco (0.04)
- North America > United States > New York (0.04)
- (3 more...)
Trace Reconstruction with Language Models
Weindel, Franziska, Girsch, Michael, Heckel, Reinhard
The general trace reconstruction problem seeks to recover an original sequence from its noisy copies independently corrupted by deletions, insertions, and substitutions. This problem arises in applications such as DNA data storage, a promising storage medium due to its high information density and longevity. However, errors introduced during DNA synthesis, storage, and sequencing require correction through algorithms and codes, with trace reconstruction often used as part of the data retrieval process. In this work, we propose TReconLM, which leverages language models trained on next-token prediction for trace reconstruction. We pretrain language models on synthetic data and fine-tune on real-world data to adapt to technology-specific error patterns. TReconLM outperforms state-of-the-art trace reconstruction algorithms, including prior deep learning approaches, recovering a substantially higher fraction of sequences without error.
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- South America > Peru > Cusco Department > Cusco Province > Cusco (0.04)
- Asia > Middle East > Jordan (0.04)
Reviews: Clustering Billions of Reads for DNA Data Storage
The paper presents a solution to a new type of clustering problem that has emerged from studies of DNA-based storage. Information is encoded within DNA sequences and retrieved using short-read sequencing technology. The short-read sequencer will create multiple short overlapping sequence reads and these have to be clustered to establish whether they are from the same place in the original sequence. The characteristics of the clustering problem is that the clusters are pretty tight in terms of edit distance (25 max diameter here - that seems quite broad given current sequencing error rates) but well separated from each other (much larger distance between them than diameter). I thought this was an interesting and timely application.
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- South America > Peru > Cusco Department > Cusco Province > Cusco (0.04)
- North America > United States > New York (0.04)
- (3 more...)
Clustering Billions of Reads for DNA Data Storage
Rashtchian, Cyrus, Makarychev, Konstantin, Racz, Miklos, Ang, Siena, Jevdjic, Djordje, Yekhanin, Sergey, Ceze, Luis, Strauss, Karin
Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. Datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy.
Clustering Billions of Reads for DNA Data Storage
Rashtchian, Cyrus, Makarychev, Konstantin, Racz, Miklos, Ang, Siena, Jevdjic, Djordje, Yekhanin, Sergey, Ceze, Luis, Strauss, Karin
Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. Datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy. To address this issue, we present a novel distributed algorithm for approximately computing the underlying clusters. Our algorithm converges efficiently on any dataset that satisfies certain separability properties, such as those coming from DNA data storage systems. We also prove that, under these assumptions, our algorithm is robust to outliers and high levels of noise. We provide empirical justification of the accuracy, scalability, and convergence of our algorithm on real and synthetic data. Compared to the state-of-the-art algorithm for clustering DNA sequences, our algorithm simultaneously achieves higher accuracy and a 1000x speedup on three real datasets.
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- South America > Peru > Cusco Department > Cusco Province > Cusco (0.04)
- North America > United States > New York (0.04)
- (3 more...)
To make better computers, researchers look to microbiology
March 2, 2017 --Computer engineers have created some amazingly small devices, capable of storing entire libraries of music and movies in the palm of your hand. But geneticists say Mother Nature can do even better. DNA, where all of biology's information is stored, is incredibly dense. The whole genome of an organism fits into a cell that is invisible to the naked eye. That's why computer scientists are turning to microbiology to design the next best way to store humanity's ever-increasing collection of digital data.
- North America > United States > New York (0.05)
- North America > United States > Massachusetts (0.05)
- North America > United States > California > San Francisco County > San Francisco (0.05)
- Europe > United Kingdom (0.05)
To make better computers, researchers turn to microbiology
March 2, 2017 --Computer engineers have created some amazingly small devices, capable of storing entire libraries of music and movies in the palm of your hand. But geneticists say Mother Nature can do even better. DNA, where all of biology's information is stored, is incredibly dense. The whole genome of an organism fits into a cell that is invisible to the naked eye. That's why computer scientists are turning to microbiology to design the next best way to store humanity's ever-increasing collection of digital data.
- North America > United States > New York (0.05)
- North America > United States > Massachusetts (0.05)
- North America > United States > California > San Francisco County > San Francisco (0.05)
- Europe > United Kingdom (0.05)